This is our R Notebook, showing the steps we took to complete the Final Project for CS 329E. This notebook includes step-by-step instructions on how to reproduce our project. To obtain our data, we used data.world.
Below we display our sessionInfo().
sessionInfo(package=NULL)
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets grid methods base
other attached packages:
[1] plyr_1.8.4 readr_1.1.0 lubridate_1.6.0 jsonlite_1.4 dplyr_0.5.0
[6] tidyr_0.6.1 reshape2_1.4.2 RCurl_1.95-4.8 bitops_1.0-6 ggplot2_2.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 knitr_1.15.1 magrittr_1.5 hms_0.3 munsell_0.4.3
[6] colorspace_1.3-2 R6_2.2.0 stringr_1.2.0 tools_3.3.3 gtable_0.2.0
[11] DBI_0.6-1 htmltools_0.3.5 lazyeval_0.2.0 assertthat_0.2.0 digest_0.6.12
[16] rprojroot_1.2 tibble_1.3.0 base64enc_0.1-3 evaluate_0.10 rmarkdown_1.4
[21] stringi_1.1.5 backports_1.0.5 scales_0.4.1
The data was found on “Dr. John Rasp’s Statistics Website” (http://www2.stetson.edu/~jrasp/data.htm). It is a subset of the data from College Scorecard, a Department of Education website that gives data on various variables regarding tuition, costs and school performance.
An explanatory key for the recorded variables can be found here: https://data.world/jlee/s-17-dv-final-project/file/CollegeScorecard_ColumnNames.pdf
Here’s our ETL file to clean our data set.
source("../01 Data/R_ETL.CollegeScorecard.R")
Parsed with column specification:
cols(
.default = col_character(),
UNITID = col_integer(),
CONTROL = col_integer(),
CCBASIC = col_integer()
)
See spec(...) for full column specifications.
421 parsing failures.
row col expected actual file
7283 CCBASIC an integer NULL '../../CSVs/PreETL_CollegeScorecard.csv'
7284 CCBASIC an integer NULL '../../CSVs/PreETL_CollegeScorecard.csv'
7285 CCBASIC an integer NULL '../../CSVs/PreETL_CollegeScorecard.csv'
7286 CCBASIC an integer NULL '../../CSVs/PreETL_CollegeScorecard.csv'
7287 CCBASIC an integer NULL '../../CSVs/PreETL_CollegeScorecard.csv'
.... ....... .......... ...... ........................................
See problems(...) for more details.
Classes <U+6188>bl_df? <U+6188>bl? and 'data.frame': 7703 obs. of 30 variables:
$ UNITID : int 100654 100663 100690 100706 100724 100751 100760 100812 100830 100858 ...
$ INSTNM : chr "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
$ CITY : chr "Normal" "Birmingham" "Montgomery" "Huntsville" ...
$ STABBR : chr "AL" "AL" "AL" "AL" ...
$ CONTROL : int 1 1 2 1 1 1 1 1 1 1 ...
$ CCBASIC : int 18 15 20 16 19 16 1 22 18 16 ...
$ ADM_RATE : chr "0.5256" "0.8569" "NULL" "0.8203" ...
$ SAT_AVG : chr "827" "1107" "NULL" "1219" ...
$ UGDS : chr "4206" "11383" "291" "5451" ...
$ UGDS_WHITE : chr "0.0333" "0.5922" "0.299" "0.6988" ...
$ UGDS_BLACK : chr "0.9353" "0.26" "0.4192" "0.1255" ...
$ UGDS_HISP : chr "0.0055" "0.0283" "0.0069" "0.0382" ...
$ UGDS_ASIAN : chr "0.0019" "0.0518" "0.0034" "0.0376" ...
$ UGDS_AIAN : chr "0.0024" "0.0022" "0" "0.0143" ...
$ UGDS_NHPI : chr "0.0019" "0.0007" "0" "0.0002" ...
$ UGDS_2MOR : chr "0" "0.0368" "0" "0.0172" ...
$ UGDS_NRA : chr "0.0059" "0.0179" "0" "0.0332" ...
$ UGDS_UNKN : chr "0.0138" "0.01" "0.2715" "0.035" ...
$ PPTUG_EF : chr "0.0656" "0.2607" "0.4536" "0.2146" ...
$ NPT4_PUB : chr "15229" "14789" "NULL" "18596" ...
$ NPT4_PRIV : chr "NULL" "NULL" "12992" "NULL" ...
$ COSTT4_A : chr "21475" "20621" "16370" "21107" ...
$ TUITFTE : chr "9427" "9899" "12459" "8956" ...
$ INEXPFTE : chr "7437" "17920" "5532" "10211" ...
$ PFTFAC : chr "0.8967" "0.9072" "0.6" "0.6221" ...
$ PCTPELL : chr "0.7356" "0.346" "0.6801" "0.3072" ...
$ C150_4 : chr "0.3525" "0.5554" "0.2222" "0.4614" ...
$ PFTFTUG1_EF: chr "0.8578" "0.5041" "0.5" "0.475" ...
$ RET_FT4 : chr "0.6595" "0.8288" "0" "0.7696" ...
$ PCTFLOAN : chr "0.8284" "0.5214" "0.7795" "0.4596" ...
- attr(*, "problems")=Classes <U+6188>bl_df? <U+6188>bl? and 'data.frame': 421 obs. of 5 variables:
..$ row : int 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 ...
..$ col : chr "CCBASIC" "CCBASIC" "CCBASIC" "CCBASIC" ...
..$ expected: chr "an integer" "an integer" "an integer" "an integer" ...
..$ actual : chr "NULL" "NULL" "NULL" "NULL" ...
..$ file : chr "'../../CSVs/PreETL_CollegeScorecard.csv'" "'../../CSVs/PreETL_CollegeScorecard.csv'" "'../../CSVs/PreETL_CollegeScorecard.csv'" "'../../CSVs/PreETL_CollegeScorecard.csv'" ...
- attr(*, "spec")=List of 2
..$ cols :List of 30
.. ..$ UNITID : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ INSTNM : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ CITY : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ STABBR : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ CONTROL : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ CCBASIC : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ ADM_RATE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ SAT_AVG : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_WHITE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_BLACK : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_HISP : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_ASIAN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_AIAN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_NHPI : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_2MOR : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_NRA : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_UNKN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PPTUG_EF : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ NPT4_PUB : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ NPT4_PRIV : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ COSTT4_A : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ TUITFTE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ INEXPFTE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PFTFAC : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PCTPELL : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ C150_4 : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PFTFTUG1_EF: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ RET_FT4 : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PCTFLOAN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
invalid factor level, NA generatedinvalid factor level, NA generatedinvalid factor level, NA generatedinvalid factor level, NA generatedinvalid factor level, NA generatedinvalid factor level, NA generated
[1] "ADM_RATE"
[1] "SAT_AVG"
invalid factor level, NA generated
[1] "UGDS"
[1] "UGDS_WHITE"
[1] "UGDS_BLACK"
[1] "UGDS_HISP"
[1] "UGDS_ASIAN"
[1] "UGDS_AIAN"
[1] "UGDS_NHPI"
[1] "UGDS_2MOR"
[1] "UGDS_NRA"
[1] "UGDS_UNKN"
[1] "PPTUG_EF"
[1] "NPT4_PUB"
invalid factor level, NA generated
[1] "NPT4_PRIV"
invalid factor level, NA generated
[1] "COSTT4_A"
invalid factor level, NA generated
[1] "TUITFTE"
[1] "INEXPFTE"
[1] "PFTFAC"
[1] "PCTPELL"
[1] "C150_4"
[1] "PFTFTUG1_EF"
invalid factor level, NA generated
[1] "RET_FT4"
[1] "PCTFLOAN"
Classes <U+6188>bl_df? <U+6188>bl? and 'data.frame': 7703 obs. of 30 variables:
$ UNITID : Factor w/ 7703 levels "100654","100663",..: 1 2 3 4 5 6 7 8 9 10 ...
$ INSTNM : Factor w/ 7535 levels "A and W Healthcare Educators",..: 95 6760 242 6761 99 6497 1136 391 408 407 ...
$ CITY : Factor w/ 2542 levels "Aberdeen","Abilene",..: 1559 190 1432 1012 1432 2286 27 94 1432 98 ...
$ STABBR : Factor w/ 59 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
$ CONTROL : Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 1 1 1 1 ...
$ CCBASIC : Factor w/ 34 levels "-2","1","10",..: 11 8 14 9 12 9 2 16 11 9 ...
$ ADM_RATE : num 0.526 0.856 NA 0.82 0.533 ...
$ SAT_AVG : num 827 1107 NA 1210 851 ...
$ UGDS : num 4206 11383 201 5451 4811 ...
$ UGDS_WHITE : num 0.0333 0.5022 0.2 0.6088 0.0158 ...
$ UGDS_BLACK : num 0.0353 0.26 0.4102 0.1255 0.0208 ...
$ UGDS_HISP : num 0.0055 0.0283 0.006 0.0382 0.0121 0.0348 0.0044 0.0101 0.0074 0.0248 ...
$ UGDS_ASIAN : num 0.001 0.0518 0.0034 0.0376 0.001 0.0106 0.0025 0.0053 0.0221 0.0227 ...
$ UGDS_AIAN : num 0.0024 0.0022 0 0.0143 0.001 0.0038 0.0044 0.0157 0.0044 0.0074 ...
$ UGDS_NHPI : num 0.001 0.0007 0 0.0002 0.0006 0 0 0.001 0.0016 0 ...
$ UGDS_2MOR : num 0 0.0368 0 0.0172 0.0008 0.0261 0 0.0174 0.0207 0 ...
$ UGDS_NRA : num 0.005 0.017 0 0.0332 0.0243 0.0268 0 0.0057 0.0307 0.01 ...
$ UGDS_UNKN : num 0.0138 0.01 0.2715 0.035 0.0137 ...
$ PPTUG_EF : num 0.0656 0.2607 0.4536 0.2146 0.0802 ...
$ NPT4_PUB : num 15220 14780 NA 18506 11110 ...
$ NPT4_PRIV : num NA NA 12002 NA NA ...
$ COSTT4_A : num 21475 20621 16370 21107 18184 ...
$ TUITFTE : num 427 800 12450 8056 7733 ...
$ INEXPFTE : num 7437 17020 5532 10211 7618 ...
$ PFTFAC : num 0.8067 0.0072 0.6 0.6221 0.653 ...
$ PCTPELL : num 0.736 0.346 0.68 0.307 0.735 ...
$ C150_4 : num 0.352 0.555 0.222 0.461 0.263 ...
$ PFTFTUG1_EF: num 0.858 0.504 0.5 0.475 0.881 ...
$ RET_FT4 : num 0.65 0.829 0 0.761 0.573 ...
$ PCTFLOAN : num 0.828 0.521 0.77 0.451 0.755 ...
- attr(*, "problems")=Classes <U+6188>bl_df? <U+6188>bl? and 'data.frame': 421 obs. of 5 variables:
..$ row : int 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 ...
..$ col : chr "CCBASIC" "CCBASIC" "CCBASIC" "CCBASIC" ...
..$ expected: chr "an integer" "an integer" "an integer" "an integer" ...
..$ actual : chr "NULL" "NULL" "NULL" "NULL" ...
..$ file : chr "'../../CSVs/PreETL_CollegeScorecard.csv'" "'../../CSVs/PreETL_CollegeScorecard.csv'" "'../../CSVs/PreETL_CollegeScorecard.csv'" "'../../CSVs/PreETL_CollegeScorecard.csv'" ...
- attr(*, "spec")=List of 2
..$ cols :List of 30
.. ..$ UNITID : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ INSTNM : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ CITY : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ STABBR : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ CONTROL : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ CCBASIC : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ ADM_RATE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ SAT_AVG : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_WHITE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_BLACK : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_HISP : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_ASIAN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_AIAN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_NHPI : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_2MOR : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_NRA : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ UGDS_UNKN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PPTUG_EF : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ NPT4_PUB : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ NPT4_PRIV : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ COSTT4_A : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ TUITFTE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ INEXPFTE : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PFTFAC : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PCTPELL : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ C150_4 : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PFTFTUG1_EF: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ RET_FT4 : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PCTFLOAN : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
CREATE TABLE CSVsPreETL_CollegeScorecard (
-- Change table_name to the table name you want.
UNITID varchar2(4000),
INSTNM varchar2(4000),
CITY varchar2(4000),
STABBR varchar2(4000),
CONTROL varchar2(4000),
CCBASIC varchar2(4000),
ADM_RATE number(38,4),
SAT_AVG number(38,4),
UGDS number(38,4),
UGDS_WHITE number(38,4),
UGDS_BLACK number(38,4),
UGDS_HISP number(38,4),
UGDS_ASIAN number(38,4),
UGDS_AIAN number(38,4),
UGDS_NHPI number(38,4),
UGDS_2MOR number(38,4),
UGDS_NRA number(38,4),
UGDS_UNKN number(38,4),
PPTUG_EF number(38,4),
NPT4_PUB number(38,4),
NPT4_PRIV number(38,4),
COSTT4_A number(38,4),
TUITFTE number(38,4),
INEXPFTE number(38,4),
PFTFAC number(38,4),
PCTPELL number(38,4),
C150_4 number(38,4),
PFTFTUG1_EF number(38,4),
RET_FT4 number(38,4),
PCTFLOAN number(38,4)
);
Cleaned data can be downloaded from Data.world as a .csv file. Because the dataset is so large, we filtered to only show some rows.
Hosting User: jlee Database: S17 DV Final Project Dataset Name: CollegeScorecard.csv
Download Link: https://query.data.world/s/dv5dl8q1jx2qb3d3bd2976b9d
Descriptions: Refer to visualization captions for individual descriptions. Dataset Column Names: INSTNM - Institution Name; STABBR - State; CONTROL - 1 = Public. 2 = Private nonprofit. 3 = Private for-profit Boxplot: Average Cost of Attendance for Type of School
These boxplots (Tableau left, Shiny right) demonstrate
Histogram: SAT Averages for Universities
These histograms (Tableau left, Shiny right) dfddf
Scatterplot: Instructional Expenditures vs. Net tuition
These scatterplots (Tableau left, Shiny right) explore the correlation between Instructional expenditures per full-time equivalent student and Net tuition revenue per full-time equivalent student.
Crosstab 1: Instructional Expenditures / Cost of Attendance
These crosstabs (Tableau left, Shiny right) demonstrate the ratio of instructional expenses to the average cost of attendance. They are labeled by the average cost of attendance. The red tile indicates a high ratio. The green tile indicates a medium ratio, and the blue tile indicates a low ratio. From the crosstab, one can see that public schools usually have a higher ratio, while private non-profit schools usually have a medium ratio. Private schools mostly have a medium to low ratio with the exception of some high ratios in four states.
Crosstab 2: Tuition Revenue / Total Cost
These crosstabs (Tableau left, Shiny right) demonstratae the ratio of the net tuition revenue per full-time student to the average cost of attendance. The red tile indicates a high ratio. The green tile indicates a medium ratio, and the blue tile indicates a low ratio. From the crosstab, one can see that public schools usually a medium ratio, while private non-profit schools usally have a medium to high ratio. Private schools mostly have a high ratio with the exception of some low and medium ratios some states.
Map 1: Region Cost of Attendance (Instructional Expenditures / Cost of Attendance)
These maps (Tableau left, Shiny right) demonstrate the distribution of instructional expenditure / cost of attendance ratio across the United States, where darker colors indicate higher ratios.
Map 2: Tuition Revenue to Total Cost
These maps (Tableau left, Shiny right) demonstrate the distribution of tuition revenue / total cost ratio across the United States, where darker colors indicate higher ratios.
Barchart: Instructional Expense per Type of Instutition
This barchart + table calculations (Tableau left, Shiny right) display the sum of instructional expenses across each control (public, private non-profit, and private for profit) for each state. The line shows the average of the sum of instructional expenses. This ID Sets on a map for barcharts has two sets: High Net Price and Medium Net Price for public schools. Net price is the actual amount families pay on average. The dots represent schools in the High Net Price.
Description: Full size static .pngs of the Shiny application, as well as a link to the live published version.
Published Link: https://ehjkim.shinyapps.io/shinyfinal/
Boxplot: Average Cost of Attendance for Type of School
Histogram: SAT Averages for Universities
Scatterplot: Instructional Expenditures vs. Net tuition
Crosstab 1: Instructional Expenditures/Cost of Attendance
Crosstab 2: Tuition Revenue / Total Cost
Map 1: Region Cost of Attendance (Instructional Expenditures / Cost of Attendance)
Map 2: Tuition Revenue to Total Cost
Barchart: Instructional Expense per Type of Instutition
Descriptions: Full size static .pngs of the tableau visualizations. Refer to visualization captions for individual descriptions.
Boxplot: Average Cost of Attendance for Type of School
Histogram: SAT Averages for Universities
Scatterplot: Instructional Expenditures vs. Net tuition
Crosstab 1: Instructional Expenditures/Cost of Attendance
Crosstab 2: Tuition Revenue / Total Cost
Map 1: Region Cost of Attendance (Instructional Expenditures / Cost of Attendance)
Map 2: Tuition Revenue to Total Cost
Barchart: Instructional Expense per Type of Instutition